{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model evaluation using cross-validation\n",
    "\n",
    "In this notebook, we still use numerical features only.\n",
    "\n",
    "Here we discuss the practical aspects of assessing the generalization\n",
    "performance of our model via **cross-validation** instead of a single\n",
    "train-test split.\n",
    "\n",
    "## Data preparation\n",
    "\n",
    "First, let's load the full adult census dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now drop the target from the data we will use to train our predictive\n",
    "model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "data = adult_census.drop(columns=target_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we select only the numerical columns, as seen in the previous\n",
    "notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_columns = [\"age\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n",
    "\n",
    "data_numeric = data[numerical_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now create a model using the `make_pipeline` tool to chain the\n",
    "preprocessing and the estimator in every iteration of the cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "model = make_pipeline(StandardScaler(), LogisticRegression())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The need for cross-validation\n",
    "\n",
    "In the previous notebook, we split the original data into a training set and a\n",
    "testing set. The score of a model in general depends on the way we make such a\n",
    "split. One downside of doing a single split is that it does not give any\n",
    "information about this variability. Another downside, in a setting where the\n",
    "amount of data is small, is that the data available for training and testing\n",
    "would be even smaller after splitting.\n",
    "\n",
    "Instead, we can use cross-validation. Cross-validation consists of repeating\n",
    "the procedure such that the training and testing sets are different each time.\n",
    "Generalization performance metrics are collected for each repetition and then\n",
    "aggregated. As a result we can assess the variability of our measure of the\n",
    "model's generalization performance.\n",
    "\n",
    "Note that there exists several cross-validation strategies, each of them\n",
    "defines how to repeat the `fit`/`score` procedure. In this section, we use the\n",
    "K-fold strategy: the entire dataset is split into `K` partitions. The\n",
    "`fit`/`score` procedure is repeated `K` times where at each iteration `K - 1`\n",
    "partitions are used to fit the model and `1` partition is used to score. The\n",
    "figure below illustrates this K-fold strategy.\n",
    "\n",
    "![Cross-validation diagram](../figures/cross_validation_diagram.png)\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">This figure shows the particular case of <strong>K-fold</strong> cross-validation strategy.\n",
    "For each cross-validation split, the procedure trains a clone of model on all\n",
    "the red samples and evaluate the score of the model on the blue samples. As\n",
    "mentioned earlier, there is a variety of different cross-validation\n",
    "strategies. Some of these aspects will be covered in more detail in future\n",
    "notebooks.</p>\n",
    "</div>\n",
    "\n",
    "Cross-validation is therefore computationally intensive because it requires\n",
    "training several models instead of one.\n",
    "\n",
    "In scikit-learn, the function `cross_validate` allows to do cross-validation\n",
    "and you need to pass it the model, the data, and the target. Since there\n",
    "exists several cross-validation strategies, `cross_validate` takes a parameter\n",
    "`cv` which defines the splitting strategy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "model = make_pipeline(StandardScaler(), LogisticRegression())\n",
    "cv_result = cross_validate(model, data_numeric, target, cv=5)\n",
    "cv_result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of `cross_validate` is a Python dictionary, which by default\n",
    "contains three entries:\n",
    "- (i) the time to train the model on the training data for each fold,\n",
    "  `fit_time`\n",
    "- (ii) the time to predict with the model on the testing data for each fold,\n",
    "  `score_time`\n",
    "- (iii) the default score on the testing data for each fold, `test_score`.\n",
    "\n",
    "Setting `cv=5` created 5 distinct splits to get 5 variations for the training\n",
    "and testing sets. Each training set is used to fit one model which is then\n",
    "scored on the matching test set. The default strategy when setting `cv=int` is\n",
    "the K-fold cross-validation where `K` corresponds to the (integer) number of\n",
    "splits. Setting `cv=5` or `cv=10` is a common practice, as it is a good\n",
    "trade-off between computation time and stability of the estimated variability.\n",
    "\n",
    "Note that by default the `cross_validate` function discards the `K` models\n",
    "that were trained on the different overlapping subset of the dataset. The goal\n",
    "of cross-validation is not to train a model, but rather to estimate\n",
    "approximately the generalization performance of a model that would have been\n",
    "trained to the full training set, along with an estimate of the variability\n",
    "(uncertainty on the generalization accuracy).\n",
    "\n",
    "You can pass additional parameters to\n",
    "[`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)\n",
    "to collect additional information, such as the training scores of the models\n",
    "obtained on each round or even return the models themselves instead of\n",
    "discarding them. These features will be covered in a future notebook.\n",
    "\n",
    "Let's extract the scores computed on the test fold of each cross-validation\n",
    "round from the `cv_result` dictionary and compute the mean accuracy and the\n",
    "variation of the accuracy across folds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = cv_result[\"test_score\"]\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that by computing the standard-deviation of the cross-validation scores,\n",
    "we can estimate the uncertainty of our model generalization performance. This\n",
    "is the main advantage of cross-validation and can be crucial in practice, for\n",
    "example when comparing different models to figure out whether one is better\n",
    "than the other or whether our measures of the generalization performance of\n",
    "each model are within the error bars of one-another.\n",
    "\n",
    "In this particular case, only the first 2 decimals seem to be trustworthy. If\n",
    "you go up in this notebook, you can check that the performance we get with\n",
    "cross-validation is compatible with the one from a single train-test split."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook recap\n",
    "\n",
    "In this notebook we assessed the generalization performance of our model via\n",
    "**cross-validation**."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}